Zbiór danych zawiera 925 obserwacji i 21 atrybutów, obejmujących zarówno zmienne liczbowe, jak i kategoryczne.
Wstępna analiza wykazała brakujące wartości w wielu zmiennych.
- W zmiennych liczbowych braki zastąpiono
medianą.
- W zmiennych kategorycznych braki zastąpiono etykietą
“Unknown”.
Po przetworzeniu danych wszystkie zmienne były kompletne i gotowe do dalszych analiz.
Zbiór danych nadaje się do dalszej analizy regresji lub klasyfikacji, ponieważ większość zmiennych jest niezależna, jest dobrze przygotowany do analizy predykcyjnej.
W celu przewidzenia pojemności właściwej materiałów elektrody (Capacitance (F/g)) wytrenowano trzy różne modele regresyjne: Linear Regression (LM), Random Forest (RF) oraz Gradient Boosting (GBM).
Zbiór danych został oczyszczony i przygotowany do modelowania.
Usunięto kolumny, które mogłyby zaburzyć działanie modeli: między innymi identyfikatory (Ref), zmienne opisowe o dużej liczbie kategorii, takie jak Electrolyte Chemical Formula czy Electrode Configuration.
Kolumny kategoryczne o liczbie unikalnych wartości większej niż 50 również zostały pominięte, aby uniknąć problemów z modelami, które nie obsługują wielu poziomów kategorii.
Pozostałe zmienne przekształcono do formatu liczbowego lub czynnikowego (factor) w zależności od typu danych.
Dane podzielono na zbiór treningowy (80%) oraz testowy (20%).
Linear Regression (LM): Model liniowy wykazał najsłabsze dopasowanie – wysokie RMSE i MAE oraz niskie R² (0.149). Sugeruje to, że zależności między cechami materiałów a pojemnością są nieliniowe.
Random Forest (RF): Najlepszy model w analizie. Uzyskał najniższe RMSE i MAE oraz najwyższe R² (0.719), co wskazuje na dobre dopasowanie do danych. Model skutecznie uchwycił nieliniowe zależności oraz interakcje między zmiennymi.
Gradient Boosting (GBM): Wyniki pośrednie między LM a RF. RMSE i MAE były niższe niż w LM, lecz wyższe niż w RF, R² = 0.514. Model może osiągnąć lepsze wyniki przy odpowiednim strojenie hiperparametrów.
Potential.Window..V. – największy wpływ na model. Sugeruje, że zakres potencjału roboczego jest kluczowy dla przewidywań.
Electrolyte.Concentration..M. – stężenie elektrolitu silnie wpływa na wynik.
Upper/Lower Limit of Potential Window – granice okna potencjału są istotne, co ma sens fizykochemicznie.
Ratio.of.ID.IG – wskaźnik struktury materiału też ma znaczenie.
Pozostałe cechy miały wyraźnie mniejsze znaczenie, co sugeruje, że proces trenowania modelu można by ograniczyć do najbardziej wpływowych zmiennych, bez istotnej utraty jakości predykcji.
Zastosowanie metod wyjaśnialnej sztucznej inteligencji (XAI) za pomocą SHAP pozwoliło ocenić wpływ poszczególnych cech na predykcję pojemności. Spośród trzech wytrenowanych modeli najlepszym narzędziem do przewidywania pojemności materiałów okazał się Random Forest, zarówno pod względem jakości predykcji, jak i stabilności modelu. Linear Regression nie była odpowiednia ze względu na nieliniowe zależności, natomiast GBM może być konkurencyjny po dostrojeniu hiperparametrów.
# # Lista paczek
required_packages <- c(
"tidyverse",
"data.table",
"janitor",
"skimr",
"naniar",
"plotly",
"DT",
"knitr",
"rmarkdown",
"ggcorrplot",
"randomForest",
"iml",
"htmltools",
"caret",
"gbm",
"DALEX",
"DALEXtra"
)
# Ustawienie serwera CRAN
options(repos = c(CRAN = "https://cloud.r-project.org"))
# # Instalacja brakujących paczek
to_install <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(to_install) > 0){ install.packages(to_install, dependencies = TRUE) }
# Ładowanie paczek
lapply(required_packages, library, character.only = TRUE)
## [[1]]
## [1] "lubridate" "forcats" "stringr" "dplyr" "purrr" "readr"
## [7] "tidyr" "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [13] "grDevices" "utils" "datasets" "methods" "base"
##
## [[2]]
## [1] "data.table" "lubridate" "forcats" "stringr" "dplyr"
## [6] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [11] "tidyverse" "stats" "graphics" "grDevices" "utils"
## [16] "datasets" "methods" "base"
##
## [[3]]
## [1] "janitor" "data.table" "lubridate" "forcats" "stringr"
## [6] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [11] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [16] "utils" "datasets" "methods" "base"
##
## [[4]]
## [1] "skimr" "janitor" "data.table" "lubridate" "forcats"
## [6] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [11] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [16] "grDevices" "utils" "datasets" "methods" "base"
##
## [[5]]
## [1] "naniar" "skimr" "janitor" "data.table" "lubridate"
## [6] "forcats" "stringr" "dplyr" "purrr" "readr"
## [11] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [16] "graphics" "grDevices" "utils" "datasets" "methods"
## [21] "base"
##
## [[6]]
## [1] "plotly" "naniar" "skimr" "janitor" "data.table"
## [6] "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [11] "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [16] "stats" "graphics" "grDevices" "utils" "datasets"
## [21] "methods" "base"
##
## [[7]]
## [1] "DT" "plotly" "naniar" "skimr" "janitor"
## [6] "data.table" "lubridate" "forcats" "stringr" "dplyr"
## [11] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [16] "tidyverse" "stats" "graphics" "grDevices" "utils"
## [21] "datasets" "methods" "base"
##
## [[8]]
## [1] "knitr" "DT" "plotly" "naniar" "skimr"
## [6] "janitor" "data.table" "lubridate" "forcats" "stringr"
## [11] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [16] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [21] "utils" "datasets" "methods" "base"
##
## [[9]]
## [1] "rmarkdown" "knitr" "DT" "plotly" "naniar"
## [6] "skimr" "janitor" "data.table" "lubridate" "forcats"
## [11] "stringr" "dplyr" "purrr" "readr" "tidyr"
## [16] "tibble" "ggplot2" "tidyverse" "stats" "graphics"
## [21] "grDevices" "utils" "datasets" "methods" "base"
##
## [[10]]
## [1] "ggcorrplot" "rmarkdown" "knitr" "DT" "plotly"
## [6] "naniar" "skimr" "janitor" "data.table" "lubridate"
## [11] "forcats" "stringr" "dplyr" "purrr" "readr"
## [16] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [21] "graphics" "grDevices" "utils" "datasets" "methods"
## [26] "base"
##
## [[11]]
## [1] "randomForest" "ggcorrplot" "rmarkdown" "knitr" "DT"
## [6] "plotly" "naniar" "skimr" "janitor" "data.table"
## [11] "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [16] "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [21] "stats" "graphics" "grDevices" "utils" "datasets"
## [26] "methods" "base"
##
## [[12]]
## [1] "iml" "randomForest" "ggcorrplot" "rmarkdown" "knitr"
## [6] "DT" "plotly" "naniar" "skimr" "janitor"
## [11] "data.table" "lubridate" "forcats" "stringr" "dplyr"
## [16] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [21] "tidyverse" "stats" "graphics" "grDevices" "utils"
## [26] "datasets" "methods" "base"
##
## [[13]]
## [1] "htmltools" "iml" "randomForest" "ggcorrplot" "rmarkdown"
## [6] "knitr" "DT" "plotly" "naniar" "skimr"
## [11] "janitor" "data.table" "lubridate" "forcats" "stringr"
## [16] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [21] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [26] "utils" "datasets" "methods" "base"
##
## [[14]]
## [1] "caret" "lattice" "htmltools" "iml" "randomForest"
## [6] "ggcorrplot" "rmarkdown" "knitr" "DT" "plotly"
## [11] "naniar" "skimr" "janitor" "data.table" "lubridate"
## [16] "forcats" "stringr" "dplyr" "purrr" "readr"
## [21] "tidyr" "tibble" "ggplot2" "tidyverse" "stats"
## [26] "graphics" "grDevices" "utils" "datasets" "methods"
## [31] "base"
##
## [[15]]
## [1] "gbm" "caret" "lattice" "htmltools" "iml"
## [6] "randomForest" "ggcorrplot" "rmarkdown" "knitr" "DT"
## [11] "plotly" "naniar" "skimr" "janitor" "data.table"
## [16] "lubridate" "forcats" "stringr" "dplyr" "purrr"
## [21] "readr" "tidyr" "tibble" "ggplot2" "tidyverse"
## [26] "stats" "graphics" "grDevices" "utils" "datasets"
## [31] "methods" "base"
##
## [[16]]
## [1] "DALEX" "gbm" "caret" "lattice" "htmltools"
## [6] "iml" "randomForest" "ggcorrplot" "rmarkdown" "knitr"
## [11] "DT" "plotly" "naniar" "skimr" "janitor"
## [16] "data.table" "lubridate" "forcats" "stringr" "dplyr"
## [21] "purrr" "readr" "tidyr" "tibble" "ggplot2"
## [26] "tidyverse" "stats" "graphics" "grDevices" "utils"
## [31] "datasets" "methods" "base"
##
## [[17]]
## [1] "DALEXtra" "DALEX" "gbm" "caret" "lattice"
## [6] "htmltools" "iml" "randomForest" "ggcorrplot" "rmarkdown"
## [11] "knitr" "DT" "plotly" "naniar" "skimr"
## [16] "janitor" "data.table" "lubridate" "forcats" "stringr"
## [21] "dplyr" "purrr" "readr" "tidyr" "tibble"
## [26] "ggplot2" "tidyverse" "stats" "graphics" "grDevices"
## [31] "utils" "datasets" "methods" "base"
set.seed(12345)
# Wczytanie danych z CSV
df <- read_csv("data/data.csv")
# Podgląd pierwszych 10 wierszy
head(df, 10)
## # A tibble: 10 × 21
## Ref. Limits of Potential …¹ Lower Limit of Poten…² Upper Limit of Poten…³
## <chr> <chr> <dbl> <dbl>
## 1 DOI: 10… 0 to 0.8 0 0.8
## 2 DOI: 10… 0 to 1 0 1
## 3 DOI: 10… 0 to 1 0 1
## 4 DOI: 10… 0 to 1 0 1
## 5 DOI: 10… 0 to 1 0 1
## 6 DOI: 10… 0 to 0.5 0 0.5
## 7 DOI: 10… -0.4 to 0.2 -0.4 0.2
## 8 DOI: 10… -0.4 to 0.2 -0.4 0.2
## 9 DOI: 10… -0.4 to 0.2 -0.4 0.2
## 10 DOI: 10… -0.4 to 0.2 -0.4 0.2
## # ℹ abbreviated names: ¹`Limits of Potential Window (V)`,
## # ²`Lower Limit of Potential Window (V)`,
## # ³`Upper Limit of Potential Window (V)`
## # ℹ 17 more variables: `Potential Window (V)` <dbl>,
## # `Current Density (A/g)` <dbl>, `Capacitance (F/g)` <dbl>,
## # `Specific Surface Area (m^2/g)` <dbl>,
## # `Charge Transfer Resistance (Rct) (ohm)` <dbl>, …
# Struktura danych
str(df)
## spc_tbl_ [925 × 21] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ Ref. : chr [1:925] "DOI: 10.1039/c7ta03093b" "DOI: 10.1039/c6ta10933k" "DOI: 10.1039/c6ta10933k" "DOI: 10.1039/c6ta10933k" ...
## $ Limits of Potential Window (V) : chr [1:925] "0 to 0.8" "0 to 1" "0 to 1" "0 to 1" ...
## $ Lower Limit of Potential Window (V) : num [1:925] 0 0 0 0 0 0 -0.4 -0.4 -0.4 -0.4 ...
## $ Upper Limit of Potential Window (V) : num [1:925] 0.8 1 1 1 1 0.5 0.2 0.2 0.2 0.2 ...
## $ Potential Window (V) : num [1:925] 0.8 1 1 1 1 0.5 0.6 0.6 0.6 0.6 ...
## $ Current Density (A/g) : num [1:925] 1 1 2 5 10 1 1 1 1 1 ...
## $ Capacitance (F/g) : num [1:925] 680 367 338 283 246 872 143 306 360 483 ...
## $ Specific Surface Area (m^2/g) : num [1:925] 186 537 537 537 537 ...
## $ Charge Transfer Resistance (Rct) (ohm) : num [1:925] NA 6.1 6.1 6.1 6.1 NA NA NA NA NA ...
## $ Equivalent Series Resistance (Rs) (ohm) : num [1:925] 7.7 1.95 1.95 1.95 1.95 0.8 NA NA NA NA ...
## $ Electrode Configuration : chr [1:925] "CNF/RGO/moOxNy" "sulfur-doped graphene foam (SGF)" "sulfur-doped graphene foam (SGF)" "sulfur-doped graphene foam (SGF)" ...
## $ Pore Size (nm) : num [1:925] NA NA NA NA NA NA NA NA NA NA ...
## $ Pore Volume (cm^3/g) : num [1:925] NA NA NA NA NA NA NA NA NA NA ...
## $ Ratio of ID/IG : num [1:925] 1.45 1.28 1.28 1.28 1.28 ...
## $ N at% : num [1:925] 2.1 0 0 0 0 NA NA NA NA NA ...
## $ C at% : num [1:925] NA 85.6 85.6 85.6 85.6 NA NA NA NA NA ...
## $ O at% : num [1:925] NA 9.1 9.1 9.1 9.1 NA NA NA NA NA ...
## $ Electrolyte Chemical Formula : chr [1:925] "H2SO4" "KOH" "KOH" "KOH" ...
## $ Electrolyte Ionic Conductivity : num [1:925] 7 6 6 6 6 6 NA NA NA NA ...
## $ Electrolyte Concentration (M) : num [1:925] 1 6 6 6 6 2 NA NA NA NA ...
## $ Cell Configuration (three/two electrode system): chr [1:925] "three-electrode system" "two-electrode system" "two-electrode system" "two-electrode system" ...
## - attr(*, "spec")=
## .. cols(
## .. Ref. = col_character(),
## .. `Limits of Potential Window (V)` = col_character(),
## .. `Lower Limit of Potential Window (V)` = col_double(),
## .. `Upper Limit of Potential Window (V)` = col_double(),
## .. `Potential Window (V)` = col_double(),
## .. `Current Density (A/g)` = col_double(),
## .. `Capacitance (F/g)` = col_double(),
## .. `Specific Surface Area (m^2/g)` = col_double(),
## .. `Charge Transfer Resistance (Rct) (ohm)` = col_double(),
## .. `Equivalent Series Resistance (Rs) (ohm)` = col_double(),
## .. `Electrode Configuration` = col_character(),
## .. `Pore Size (nm)` = col_double(),
## .. `Pore Volume (cm^3/g)` = col_double(),
## .. `Ratio of ID/IG` = col_double(),
## .. `N at%` = col_double(),
## .. `C at%` = col_double(),
## .. `O at%` = col_double(),
## .. `Electrolyte Chemical Formula` = col_character(),
## .. `Electrolyte Ionic Conductivity` = col_double(),
## .. `Electrolyte Concentration (M)` = col_double(),
## .. `Cell Configuration (three/two electrode system)` = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
# Podstawowe statystyki
skimr::skim(df)
| Name | df |
| Number of rows | 925 |
| Number of columns | 21 |
| _______________________ | |
| Column type frequency: | |
| character | 5 |
| numeric | 16 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| Ref. | 0 | 1.00 | 20 | 38 | 0 | 198 | 0 |
| Limits of Potential Window (V) | 4 | 1.00 | 6 | 13 | 0 | 63 | 0 |
| Electrode Configuration | 0 | 1.00 | 2 | 104 | 0 | 353 | 0 |
| Electrolyte Chemical Formula | 22 | 0.98 | 3 | 54 | 0 | 23 | 0 |
| Cell Configuration (three/two electrode system) | 14 | 0.98 | 20 | 22 | 0 | 2 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| Lower Limit of Potential Window (V) | 4 | 1.00 | -0.23 | 0.37 | -1.10 | -0.30 | 0.00 | 0.00 | 0.20 | ▂▁▁▂▇ |
| Upper Limit of Potential Window (V) | 4 | 1.00 | 0.63 | 0.45 | -0.20 | 0.40 | 0.60 | 0.80 | 3.50 | ▇▇▁▁▁ |
| Potential Window (V) | 5 | 0.99 | 0.86 | 0.35 | 0.40 | 0.60 | 0.82 | 1.00 | 3.50 | ▇▁▁▁▁ |
| Current Density (A/g) | 16 | 0.98 | 5.86 | 13.35 | 0.05 | 1.00 | 2.00 | 5.00 | 200.00 | ▇▁▁▁▁ |
| Capacitance (F/g) | 17 | 0.98 | 415.50 | 447.53 | 1.40 | 148.60 | 260.25 | 509.85 | 3344.08 | ▇▁▁▁▁ |
| Specific Surface Area (m^2/g) | 572 | 0.38 | 417.44 | 546.58 | 8.90 | 57.00 | 159.97 | 546.00 | 2400.00 | ▇▂▁▁▁ |
| Charge Transfer Resistance (Rct) (ohm) | 786 | 0.15 | 3.05 | 4.61 | 0.08 | 0.67 | 1.54 | 3.24 | 24.20 | ▇▁▁▁▁ |
| Equivalent Series Resistance (Rs) (ohm) | 772 | 0.17 | 1.60 | 2.43 | 0.20 | 0.35 | 0.58 | 2.00 | 17.50 | ▇▁▁▁▁ |
| Pore Size (nm) | 769 | 0.17 | 8.62 | 8.10 | 0.53 | 3.04 | 4.34 | 13.62 | 44.13 | ▇▂▂▁▁ |
| Pore Volume (cm^3/g) | 729 | 0.21 | 0.49 | 0.59 | 0.02 | 0.17 | 0.22 | 0.51 | 2.35 | ▇▁▁▁▁ |
| Ratio of ID/IG | 596 | 0.36 | 1.12 | 0.43 | 0.12 | 0.94 | 1.05 | 1.17 | 2.90 | ▁▇▁▁▁ |
| N at% | 690 | 0.25 | 2.50 | 4.57 | 0.00 | 0.00 | 0.00 | 3.20 | 23.82 | ▇▁▁▁▁ |
| C at% | 699 | 0.24 | 66.52 | 28.66 | 1.40 | 37.32 | 81.00 | 85.57 | 98.10 | ▂▂▁▂▇ |
| O at% | 703 | 0.24 | 19.18 | 14.49 | 1.90 | 8.88 | 13.70 | 27.10 | 54.28 | ▇▆▁▂▂ |
| Electrolyte Ionic Conductivity | 99 | 0.89 | 5.81 | 1.39 | 1.00 | 6.00 | 6.00 | 7.00 | 8.00 | ▁▂▁▇▅ |
| Electrolyte Concentration (M) | 62 | 0.93 | 2.58 | 2.19 | 0.10 | 1.00 | 1.00 | 6.00 | 6.00 | ▇▂▁▁▅ |
# Sprawdzenie braków danych
naniar::miss_var_summary(df)
## # A tibble: 21 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 Charge Transfer Resistance (Rct) (ohm) 786 85.0
## 2 Equivalent Series Resistance (Rs) (ohm) 772 83.5
## 3 Pore Size (nm) 769 83.1
## 4 Pore Volume (cm^3/g) 729 78.8
## 5 O at% 703 76
## 6 C at% 699 75.6
## 7 N at% 690 74.6
## 8 Ratio of ID/IG 596 64.4
## 9 Specific Surface Area (m^2/g) 572 61.8
## 10 Electrolyte Ionic Conductivity 99 10.7
## # ℹ 11 more rows
# Wyświetlenie danych w interaktywnej tabeli
DT::datatable(df, options = list(pageLength = 10))
n_before <- nrow(df)
# Liczba wierszy przed przetwarzaniem danych
cat("Liczba wierszy przed przetwarzaniem braków:", n_before, "\n")
## Liczba wierszy przed przetwarzaniem braków: 925
# Przetwarzanie
# Brakujące wartości (NA) w kolumnach liczbowych zastępowane są medianą danej kolumny
numeric_cols <- df %>% select(where(is.numeric)) %>% names()
df[numeric_cols] <- df[numeric_cols] %>%
mutate(across(everything(), ~ ifelse(is.na(.), median(., na.rm = TRUE), .)))
# Przetwarzanie
# Brakujące wartości w kolumnach tekstowych zastępowane są etykietą "Unknown"
categorical_cols <- df %>% select(where(is.character)) %>% names()
df[categorical_cols] <- df[categorical_cols] %>%
mutate(across(everything(), ~ ifelse(is.na(.), "Unknown", .)))
# Liczba wierszy po przetwarzaniu braków
n_after <- nrow(df)
cat("Liczba wierszy po przetwarzaniu braków:", n_after, "\n")
## Liczba wierszy po przetwarzaniu braków: 925
# Ponowne sprawdzenie braków danych
naniar::miss_var_summary(df)
## # A tibble: 21 × 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 Ref. 0 0
## 2 Limits of Potential Window (V) 0 0
## 3 Lower Limit of Potential Window (V) 0 0
## 4 Upper Limit of Potential Window (V) 0 0
## 5 Potential Window (V) 0 0
## 6 Current Density (A/g) 0 0
## 7 Capacitance (F/g) 0 0
## 8 Specific Surface Area (m^2/g) 0 0
## 9 Charge Transfer Resistance (Rct) (ohm) 0 0
## 10 Equivalent Series Resistance (Rs) (ohm) 0 0
## # ℹ 11 more rows
# Liczba wierszy i kolumn
cat("Liczba wierszy:", nrow(df), "\n")
## Liczba wierszy: 925
cat("Liczba kolumn:", ncol(df), "\n\n")
## Liczba kolumn: 21
# Typy kolumn
sapply(df, class)
## Ref.
## "character"
## Limits of Potential Window (V)
## "character"
## Lower Limit of Potential Window (V)
## "numeric"
## Upper Limit of Potential Window (V)
## "numeric"
## Potential Window (V)
## "numeric"
## Current Density (A/g)
## "numeric"
## Capacitance (F/g)
## "numeric"
## Specific Surface Area (m^2/g)
## "numeric"
## Charge Transfer Resistance (Rct) (ohm)
## "numeric"
## Equivalent Series Resistance (Rs) (ohm)
## "numeric"
## Electrode Configuration
## "character"
## Pore Size (nm)
## "numeric"
## Pore Volume (cm^3/g)
## "numeric"
## Ratio of ID/IG
## "numeric"
## N at%
## "numeric"
## C at%
## "numeric"
## O at%
## "numeric"
## Electrolyte Chemical Formula
## "character"
## Electrolyte Ionic Conductivity
## "numeric"
## Electrolyte Concentration (M)
## "numeric"
## Cell Configuration (three/two electrode system)
## "character"
# Liczba braków danych w całym zbiorze
total_missing <- sum(is.na(df))
cat("\nŁączna liczba braków danych:", total_missing, "\n")
##
## Łączna liczba braków danych: 0
# Podstawowe statystyki dla zmiennych liczbowych
df %>%
select(where(is.numeric)) %>%
summary()
## Lower Limit of Potential Window (V) Upper Limit of Potential Window (V)
## Min. :-1.1000 Min. :-0.2000
## 1st Qu.:-0.3000 1st Qu.: 0.4000
## Median : 0.0000 Median : 0.6000
## Mean :-0.2333 Mean : 0.6299
## 3rd Qu.: 0.0000 3rd Qu.: 0.8000
## Max. : 0.2000 Max. : 3.5000
## Potential Window (V) Current Density (A/g) Capacitance (F/g)
## Min. :0.4000 Min. : 0.05 Min. : 1.4
## 1st Qu.:0.6000 1st Qu.: 1.00 1st Qu.: 150.8
## Median :0.8250 Median : 2.00 Median : 260.2
## Mean :0.8632 Mean : 5.79 Mean : 412.6
## 3rd Qu.:1.0000 3rd Qu.: 5.00 3rd Qu.: 493.6
## Max. :3.5000 Max. :200.00 Max. :3344.1
## Specific Surface Area (m^2/g) Charge Transfer Resistance (Rct) (ohm)
## Min. : 8.896 Min. : 0.080
## 1st Qu.: 159.970 1st Qu.: 1.540
## Median : 159.970 Median : 1.540
## Mean : 258.225 Mean : 1.767
## 3rd Qu.: 159.970 3rd Qu.: 1.540
## Max. :2400.000 Max. :24.200
## Equivalent Series Resistance (Rs) (ohm) Pore Size (nm) Pore Volume (cm^3/g)
## Min. : 0.200 Min. : 0.530 Min. :0.020
## 1st Qu.: 0.580 1st Qu.: 4.337 1st Qu.:0.217
## Median : 0.580 Median : 4.337 Median :0.217
## Mean : 0.749 Mean : 5.059 Mean :0.274
## 3rd Qu.: 0.580 3rd Qu.: 4.337 3rd Qu.:0.217
## Max. :17.500 Max. :44.131 Max. :2.350
## Ratio of ID/IG N at% C at% O at%
## Min. :0.120 Min. : 0.000 Min. : 1.40 Min. : 1.90
## 1st Qu.:1.050 1st Qu.: 0.000 1st Qu.:81.00 1st Qu.:13.70
## Median :1.050 Median : 0.000 Median :81.00 Median :13.70
## Mean :1.075 Mean : 0.635 Mean :77.46 Mean :15.01
## 3rd Qu.:1.050 3rd Qu.: 0.000 3rd Qu.:81.00 3rd Qu.:13.70
## Max. :2.900 Max. :23.820 Max. :98.10 Max. :54.28
## Electrolyte Ionic Conductivity Electrolyte Concentration (M)
## Min. :1.000 Min. :0.10
## 1st Qu.:6.000 1st Qu.:1.00
## Median :6.000 Median :1.00
## Mean :5.827 Mean :2.47
## 3rd Qu.:7.000 3rd Qu.:6.00
## Max. :8.000 Max. :6.00
# Podstawowe statystyki dla zmiennych kategorycznych (liczba unikalnych wartości)
df %>%
select(where(is.character)) %>%
summarise(across(everything(), n_distinct))
## # A tibble: 1 × 5
## Ref. Limits of Potential Wind…¹ Electrode Configurat…² Electrolyte Chemical…³
## <int> <int> <int> <int>
## 1 198 64 353 24
## # ℹ abbreviated names: ¹`Limits of Potential Window (V)`,
## # ²`Electrode Configuration`, ³`Electrolyte Chemical Formula`
## # ℹ 1 more variable: `Cell Configuration (three/two electrode system)` <int>
# Zamiana spacji w nazwach kolumn
names(df) <- make.names(names(df), unique = TRUE)
# Rozkłady zmiennych liczbowych
numeric_cols <- df %>% select(where(is.numeric)) %>% names()
# Histogramy dla wszystkich zmiennych liczbowych
for(col in numeric_cols){
print(
ggplot(df, aes_string(x = col)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
ggtitle(paste("Histogram:", col)) +
theme_minimal()
)
}
# Analiza zmiennych kategorycznych
categorical_cols <- df %>% select(where(is.character)) %>% names()
# Liczenie wartości unikalnych i wykresy słupkowe
for(col in categorical_cols){
df_plot <- df %>%
group_by(across(all_of(col))) %>%
summarise(count = n(), .groups = "drop") %>%
arrange(desc(count)) %>%
slice(1:10) # tylko 10 najpopularniejszych kategorii
print(ggplot(df_plot, aes(x = reorder(!!sym(col), -count), y = count)) +
geom_bar(stat="identity", fill="orange") +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
ggtitle(col))
}
# Wybieramy tylko zmienne liczbowe
numeric_cols <- df %>% select(where(is.numeric))
# Macierz korelacji
cor_matrix <- cor(numeric_cols, use = "complete.obs")
# Wyświetlenie macierzy korelacji
print(cor_matrix)
## Lower.Limit.of.Potential.Window..V.
## Lower.Limit.of.Potential.Window..V. 1.000000000
## Upper.Limit.of.Potential.Window..V. 0.636855880
## Potential.Window..V. -0.210914288
## Current.Density..A.g. 0.041002069
## Capacitance..F.g. 0.175026659
## Specific.Surface.Area..m.2.g. -0.016369465
## Charge.Transfer.Resistance..Rct...ohm. -0.012130269
## Equivalent.Series.Resistance..Rs...ohm. -0.028998317
## Pore.Size..nm. -0.113404442
## Pore.Volume..cm.3.g. -0.131274670
## Ratio.of.ID.IG 0.009058662
## N.at. -0.077133297
## C.at. 0.065335253
## O.at. -0.067008076
## Electrolyte.Ionic.Conductivity 0.141687972
## Electrolyte.Concentration..M. -0.249125592
## Upper.Limit.of.Potential.Window..V.
## Lower.Limit.of.Potential.Window..V. 0.636855880
## Upper.Limit.of.Potential.Window..V. 1.000000000
## Potential.Window..V. 0.597237542
## Current.Density..A.g. 0.026326371
## Capacitance..F.g. -0.104234613
## Specific.Surface.Area..m.2.g. 0.309511241
## Charge.Transfer.Resistance..Rct...ohm. 0.040329580
## Equivalent.Series.Resistance..Rs...ohm. 0.067374940
## Pore.Size..nm. -0.130283272
## Pore.Volume..cm.3.g. 0.087160371
## Ratio.of.ID.IG -0.004703697
## N.at. -0.046614593
## C.at. 0.185045046
## O.at. -0.172797106
## Electrolyte.Ionic.Conductivity 0.153377595
## Electrolyte.Concentration..M. -0.299186922
## Potential.Window..V.
## Lower.Limit.of.Potential.Window..V. -0.2109142878
## Upper.Limit.of.Potential.Window..V. 0.5972375418
## Potential.Window..V. 1.0000000000
## Current.Density..A.g. -0.0082795262
## Capacitance..F.g. -0.3354407656
## Specific.Surface.Area..m.2.g. 0.4030512978
## Charge.Transfer.Resistance..Rct...ohm. 0.0448321197
## Equivalent.Series.Resistance..Rs...ohm. 0.1110206887
## Pore.Size..nm. 0.0144932035
## Pore.Volume..cm.3.g. 0.2464957169
## Ratio.of.ID.IG -0.0285656499
## N.at. 0.0007152932
## C.at. 0.1797146104
## O.at. -0.1598773666
## Electrolyte.Ionic.Conductivity -0.0041546037
## Electrolyte.Concentration..M. -0.1323827758
## Current.Density..A.g. Capacitance..F.g.
## Lower.Limit.of.Potential.Window..V. 0.041002069 0.17502666
## Upper.Limit.of.Potential.Window..V. 0.026326371 -0.10423461
## Potential.Window..V. -0.008279526 -0.33544077
## Current.Density..A.g. 1.000000000 -0.00175134
## Capacitance..F.g. -0.001751340 1.00000000
## Specific.Surface.Area..m.2.g. 0.096514933 -0.15780819
## Charge.Transfer.Resistance..Rct...ohm. -0.027069515 -0.05001734
## Equivalent.Series.Resistance..Rs...ohm. -0.019455664 -0.04975992
## Pore.Size..nm. -0.042645476 -0.07041585
## Pore.Volume..cm.3.g. 0.073188199 -0.07940088
## Ratio.of.ID.IG -0.010859217 0.09438993
## N.at. -0.012131673 0.03261990
## C.at. -0.011774074 -0.19205171
## O.at. 0.012434955 0.17037233
## Electrolyte.Ionic.Conductivity -0.002992870 0.11559131
## Electrolyte.Concentration..M. 0.050341088 0.05152365
## Specific.Surface.Area..m.2.g.
## Lower.Limit.of.Potential.Window..V. -0.01636946
## Upper.Limit.of.Potential.Window..V. 0.30951124
## Potential.Window..V. 0.40305130
## Current.Density..A.g. 0.09651493
## Capacitance..F.g. -0.15780819
## Specific.Surface.Area..m.2.g. 1.00000000
## Charge.Transfer.Resistance..Rct...ohm. -0.02232633
## Equivalent.Series.Resistance..Rs...ohm. 0.01437125
## Pore.Size..nm. -0.10335526
## Pore.Volume..cm.3.g. 0.41401673
## Ratio.of.ID.IG -0.08890996
## N.at. 0.02939694
## C.at. 0.15071684
## O.at. -0.16899969
## Electrolyte.Ionic.Conductivity 0.06983506
## Electrolyte.Concentration..M. 0.07799474
## Charge.Transfer.Resistance..Rct...ohm.
## Lower.Limit.of.Potential.Window..V. -0.01213027
## Upper.Limit.of.Potential.Window..V. 0.04032958
## Potential.Window..V. 0.04483212
## Current.Density..A.g. -0.02706952
## Capacitance..F.g. -0.05001734
## Specific.Surface.Area..m.2.g. -0.02232633
## Charge.Transfer.Resistance..Rct...ohm. 1.00000000
## Equivalent.Series.Resistance..Rs...ohm. 0.62099145
## Pore.Size..nm. -0.01896747
## Pore.Volume..cm.3.g. -0.03228780
## Ratio.of.ID.IG -0.08636411
## N.at. -0.03518525
## C.at. -0.18148363
## O.at. 0.21664933
## Electrolyte.Ionic.Conductivity 0.01907102
## Electrolyte.Concentration..M. -0.12958669
## Equivalent.Series.Resistance..Rs...ohm.
## Lower.Limit.of.Potential.Window..V. -0.028998317
## Upper.Limit.of.Potential.Window..V. 0.067374940
## Potential.Window..V. 0.111020689
## Current.Density..A.g. -0.019455664
## Capacitance..F.g. -0.049759923
## Specific.Surface.Area..m.2.g. 0.014371246
## Charge.Transfer.Resistance..Rct...ohm. 0.620991454
## Equivalent.Series.Resistance..Rs...ohm. 1.000000000
## Pore.Size..nm. -0.070564524
## Pore.Volume..cm.3.g. 0.009789129
## Ratio.of.ID.IG -0.028259607
## N.at. 0.016350591
## C.at. -0.140517581
## O.at. 0.171919382
## Electrolyte.Ionic.Conductivity 0.075633124
## Electrolyte.Concentration..M. -0.111779109
## Pore.Size..nm. Pore.Volume..cm.3.g.
## Lower.Limit.of.Potential.Window..V. -0.11340444 -0.131274670
## Upper.Limit.of.Potential.Window..V. -0.13028327 0.087160371
## Potential.Window..V. 0.01449320 0.246495717
## Current.Density..A.g. -0.04264548 0.073188199
## Capacitance..F.g. -0.07041585 -0.079400883
## Specific.Surface.Area..m.2.g. -0.10335526 0.414016728
## Charge.Transfer.Resistance..Rct...ohm. -0.01896747 -0.032287798
## Equivalent.Series.Resistance..Rs...ohm. -0.07056452 0.009789129
## Pore.Size..nm. 1.00000000 0.067159723
## Pore.Volume..cm.3.g. 0.06715972 1.000000000
## Ratio.of.ID.IG -0.05139208 -0.111463696
## N.at. -0.04089784 0.122741401
## C.at. -0.03700902 -0.015388519
## O.at. 0.02880870 -0.082024148
## Electrolyte.Ionic.Conductivity -0.28628436 0.054346262
## Electrolyte.Concentration..M. -0.05450738 0.178704738
## Ratio.of.ID.IG N.at.
## Lower.Limit.of.Potential.Window..V. 0.009058662 -0.0771332968
## Upper.Limit.of.Potential.Window..V. -0.004703697 -0.0466145930
## Potential.Window..V. -0.028565650 0.0007152932
## Current.Density..A.g. -0.010859217 -0.0121316735
## Capacitance..F.g. 0.094389925 0.0326198960
## Specific.Surface.Area..m.2.g. -0.088909964 0.0293969390
## Charge.Transfer.Resistance..Rct...ohm. -0.086364109 -0.0351852548
## Equivalent.Series.Resistance..Rs...ohm. -0.028259607 0.0163505909
## Pore.Size..nm. -0.051392076 -0.0408978392
## Pore.Volume..cm.3.g. -0.111463696 0.1227414010
## Ratio.of.ID.IG 1.000000000 0.0085168640
## N.at. 0.008516864 1.0000000000
## C.at. -0.001325426 -0.1674366821
## O.at. 0.027590028 0.0707956477
## Electrolyte.Ionic.Conductivity 0.036801186 0.0475892108
## Electrolyte.Concentration..M. 0.026797248 -0.0046887428
## C.at. O.at.
## Lower.Limit.of.Potential.Window..V. 0.065335253 -0.06700808
## Upper.Limit.of.Potential.Window..V. 0.185045046 -0.17279711
## Potential.Window..V. 0.179714610 -0.15987737
## Current.Density..A.g. -0.011774074 0.01243495
## Capacitance..F.g. -0.192051713 0.17037233
## Specific.Surface.Area..m.2.g. 0.150716835 -0.16899969
## Charge.Transfer.Resistance..Rct...ohm. -0.181483629 0.21664933
## Equivalent.Series.Resistance..Rs...ohm. -0.140517581 0.17191938
## Pore.Size..nm. -0.037009025 0.02880870
## Pore.Volume..cm.3.g. -0.015388519 -0.08202415
## Ratio.of.ID.IG -0.001325426 0.02759003
## N.at. -0.167436682 0.07079565
## C.at. 1.000000000 -0.84316049
## O.at. -0.843160486 1.00000000
## Electrolyte.Ionic.Conductivity 0.058544022 -0.07981020
## Electrolyte.Concentration..M. -0.038537601 -0.01798620
## Electrolyte.Ionic.Conductivity
## Lower.Limit.of.Potential.Window..V. 0.141687972
## Upper.Limit.of.Potential.Window..V. 0.153377595
## Potential.Window..V. -0.004154604
## Current.Density..A.g. -0.002992870
## Capacitance..F.g. 0.115591307
## Specific.Surface.Area..m.2.g. 0.069835063
## Charge.Transfer.Resistance..Rct...ohm. 0.019071023
## Equivalent.Series.Resistance..Rs...ohm. 0.075633124
## Pore.Size..nm. -0.286284361
## Pore.Volume..cm.3.g. 0.054346262
## Ratio.of.ID.IG 0.036801186
## N.at. 0.047589211
## C.at. 0.058544022
## O.at. -0.079810201
## Electrolyte.Ionic.Conductivity 1.000000000
## Electrolyte.Concentration..M. 0.100270829
## Electrolyte.Concentration..M.
## Lower.Limit.of.Potential.Window..V. -0.249125592
## Upper.Limit.of.Potential.Window..V. -0.299186922
## Potential.Window..V. -0.132382776
## Current.Density..A.g. 0.050341088
## Capacitance..F.g. 0.051523650
## Specific.Surface.Area..m.2.g. 0.077994737
## Charge.Transfer.Resistance..Rct...ohm. -0.129586693
## Equivalent.Series.Resistance..Rs...ohm. -0.111779109
## Pore.Size..nm. -0.054507379
## Pore.Volume..cm.3.g. 0.178704738
## Ratio.of.ID.IG 0.026797248
## N.at. -0.004688743
## C.at. -0.038537601
## O.at. -0.017986196
## Electrolyte.Ionic.Conductivity 0.100270829
## Electrolyte.Concentration..M. 1.000000000
# Macierz korelacji
ggcorrplot::ggcorrplot(cor_matrix,
hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 3,
method="square",
colors = c("red", "white", "blue"),
title="Macierz korelacji zmiennych liczbowych")
# Numeryczne kolumny
numeric_cols <- df %>% select(where(is.numeric)) %>% names()
plots_numeric <- list()
for(col in numeric_cols){
p <- ggplot(df, aes_string(x = col)) +
geom_histogram(bins = 30, fill = "skyblue", color = "black") +
ggtitle(paste("Histogram:", col)) +
theme_minimal()
plots_numeric[[col]] <- ggplotly(p)
}
# Wyświetlanie wszystkich histogramów numerycznych
tagList(plots_numeric)
# Kategoryczne kolumny
categorical_cols <- df %>% select(where(is.character)) %>% names()
plots_categorical <- list()
for(col in categorical_cols){
df_plot <- df %>%
group_by(across(all_of(col))) %>%
summarise(count = n(), .groups = "drop") %>%
arrange(desc(count)) %>%
slice(1:10)
p <- ggplot(df_plot, aes_string(x = col, y = "count")) +
geom_bar(stat = "identity", fill = "orange") +
ggtitle(col) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
plots_categorical[[col]] <- ggplotly(p)
}
# Wyświetlanie wszystkich wykresów kategorycznych
tagList(plots_categorical)
# ------------------------------
# 1. Wybór kolumn
# ------------------------------
# Usuwamy kolumny z >50 kategoriami
too_many_levels <- df %>%
select(where(is.character)) %>%
summarise(across(everything(), ~n_distinct(.))) %>%
pivot_longer(everything(), names_to = "col", values_to = "nlevels") %>%
filter(nlevels > 50) %>%
pull(col)
cols_to_remove <- c(
"Ref",
"Electrolyte.Chemical.Formula",
"Electrode.Configuration",
too_many_levels
)
target <- "Capacitance..F.g."
features <- setdiff(names(df), c(target, cols_to_remove))
# ------------------------------
# 2. Przygotowanie zbioru danych
# ------------------------------
df_model <- df %>%
select(all_of(c(target, features))) %>%
na.omit() %>%
mutate(across(where(is.character), as.factor))
# Train/test split
train_index <- createDataPartition(df_model[[target]], p = 0.8, list = FALSE)
train <- df_model[train_index, ]
test <- df_model[-train_index, ]
# ------------------------------
# 3. MODELE
# ------------------------------
# Model 1: Linear Regression
model_lm <- lm(as.formula(paste(target, "~ .")), data = train)
# Model 2: Random Forest
model_rf <- randomForest(
as.formula(paste(target, "~ .")),
data = train,
ntree = 500
)
# Model 3: Gradient Boosting (GBM)
model_gbm <- gbm(
formula = as.formula(paste(target, "~ .")),
distribution = "gaussian",
data = train,
n.trees = 1500,
interaction.depth = 4,
shrinkage = 0.01,
n.minobsinnode = 10,
verbose = FALSE
)
# ------------------------------
# 4. Ewaluacja
# ------------------------------
evaluate_model <- function(model, test, type = "lm") {
if (type == "gbm") {
pred <- predict(model, test, n.trees = 1500)
} else {
pred <- predict(model, test)
}
rmse <- sqrt(mean((test[[target]] - pred)^2))
mae <- mean(abs(test[[target]] - pred))
r2 <- cor(test[[target]], pred)^2
return(c(RMSE = rmse, MAE = mae, R2 = r2))
}
results <- rbind(
LM = evaluate_model(model_lm, test, "lm"),
RF = evaluate_model(model_rf, test, "rf"),
GBM = evaluate_model(model_gbm, test, "gbm")
)
print(results)
## RMSE MAE R2
## LM 461.7339 296.4076 0.1490706
## RF 267.0724 148.7020 0.7185534
## GBM 349.7983 215.6324 0.5140385
# ------------------------------
# 5. SHAP — dla najlepszego modelu
# ------------------------------
best_model_name <- rownames(results)[which.min(results[, "RMSE"])]
cat("Najlepszy model:", best_model_name, "\n")
## Najlepszy model: RF
if (best_model_name == "LM") best_model <- model_lm
if (best_model_name == "RF") best_model <- model_rf
if (best_model_name == "GBM") best_model <- model_gbm
explainer <- explain(
best_model,
data = train[features],
y = train[[target]],
label = best_model_name
)
## Preparation of a new explainer is initiated
## -> model label : RF
## -> data : 742 rows 16 cols
## -> data : tibble converted into a data.frame
## -> target variable : 742 values
## -> predict function : yhat.randomForest will be used ( default )
## -> predicted values : No value for predict function target column. ( default )
## -> model_info : package randomForest , ver. 4.7.1.2 , task regression ( default )
## -> predicted values : numerical, min = 18.96545 , mean = 407.0127 , max = 2619.041
## -> residual function : difference between y and yhat ( default )
## -> residuals : numerical, min = -940.4142 , mean = -0.2768953 , max = 1194.834
## A new explainer has been created!
# Ważność cech
imp <- model_parts(explainer)
plot(imp)
# SHAP dla jednej obserwacji
shap <- predict_parts(
explainer,
new_observation = test[1, ],
type = "shap"
)
plot(shap)